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ABSTRACT 

Motivation: Second generation sequencing technology makes it 
feasible for many researches to obtain enough sequence reads 
to attempt the de novo assembly of higher eukaryotes (including 
mammals). De novo assembly not only provides a tool for 
understanding wide scale biological variation, but within human 
bio-medicine, it offers a direct way of observing both large scale 
structural variation and fine scale sequence variation. Unfortunately, 
improvements in the computational feasibility for de novo assembly 
have not matched the improvements in the gathering of sequence 
data. This is for two reasons: the inherent computational complexity 
of the problem, and the in-practice memory requirements of tools. 
Results: In this paper we use entropy compressed or succinct 
data structures to create a practical representation of the de Bruijn 
assembly graph, which requires at least a factor of 10 less storage 
than the kinds of structures used by deployed methods. In particular 
we show that when stored succinctly, the de Bruijn assembly graph 
for homo sapiens requires only 23 gigabytes of storage. Moreover, 
because our representation is entropy compressed, in the presence 
of sequencing errors it has better scaling behaviour asymptotically 
than conventional approaches. 

Availability: Binaries of programs for constructing and traversing the 
de Bruijn assembly graph are available from 
http://www.genomics.csse.unimelb.edu.au/succinctAssembly 
Contact: tom.conway@nicta.com.au 



> . 1 INTRODUCTION 



A central problem in sequence bioinformatics is that of assembling 
genomes from a collection of overlapping short fragments thereof. 
These fragments are usually the result of sequencing - the 
determination by an instrument of a sampling of subsequences 
present in a sample of DNA. The number, length and accuracy 
of these sequences, varies significantly between the specific 
technologies, as does the degree of deviation from uniform 
sampling, and all these are constantly changing as new technologies 
are developed and refined. Nonetheless, it is typically the case that 
we have anywhere from hundreds of thousands of sequences several 
hundred bases in length to hundreds of millions of sequences a few 
tens of bases in length with error rates between 0.1% and 10%, 
depending on the technology. 



The two main techniques used for reconstructing the underlying 
sequence from the short fragments are based on overlap-layout- 
consensus models, and de Bruijn graph models. The former was 
principally used with older sequencing technologies which tend to 
yield fewer longer reads, and the latter has become increasingly 
popular with the so-called next generation sequencing technologies 
which yield man y more shorter s e quenc e fragments. Irrespective 
of the technique, Me dvedev et all 00071) shows that the problem 
of sequence assembly is computationally hard, and as the correct 
solution is not rigorously defined, all practical assembly techniques 
are necessarily heuristic in nature. It is not our purpose here to 
discuss the various assembly techniques - we restrict our attention to 
certain aspects of de Bruijn graph assembly - we refer the reader to 
iMilleref aZ.lbOlCh for a fairly comprehensive review of assemblers 
and assembly techniques. 

Space consumption is a pressing practical problem for assembly 
with de Bruijn graph based algorithms (we have observed velvet 
using 20 GB to assemble staphylococcus aureus - a 2.8 Mbp 
genome), and we present a representation for the de Bruijn assembly 
graph that is extremely compact. The representations we present use 
entropy compressed or succinct data structures. These are represent- 
ations, typically of sets or sequences of integers that use an amount 
of space bounded closely by the theoretical minimum suggested by 
the zero-order entropy of the set or sequence. These representations 
combine their space efficiency with efficient access. In some cases 
query operations can be performed in constant time, and in most 
cases they are at worst logarithmic. 

Suc cinct data structures are a basic building block; Ijacobsor] 
shows more complex discrete data structures such as trees 
and graphs can be built using them. So me of the tasks for which the y 
have been used i nclude Web graphs |Claude and Navarre] ( 2007)). 
XPath indexing dArrovuelo et all ( 2009)), parti al sum s (Hon et al\ 
( 2003)) and short read alignment fcimura et al\ d2009t» . 

1.1 de Bruijn Graphs & Assembly 

Let E be an alphabet, and |E| be the number of symbols in 
that alphabet. In the case of genome assembly, the alphabet E 
is {A, C, G, t}. The length of a string s of symbols drawn from E is 
written \s\. The notation s[i, j) is used for the substring of s starting 
at position i (counting from 0) to, but not including j. 
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The directed de Bruijn graph of degree k is defined as 



V* 



ty.,E.) 

|s : s € E fe j 

{(n f ,n t ) : n f ,n t G K;n/[l,/e) = n t [0,k - 1)} 



That is, the nodes of the de Bruijn graph V* correspond to all the k 
length strings over E and an edge exists between each pair of nodes 
for which the last k — 1 symbols of the first are the same as the first 
k — 1 of the second. 

The k length string labelling a node is usually referred to as 
a fc-gram in the computer science literature and a fc-mer in the 
bioinf ormatics literature. The labels of the edges, as noted in Good 
d 19461) . are k + 1-mers. For clarity, we use p = k + 1, and refer to 
edges as p-mers. 

We note that amongst the special properties of the de Bruijn graph 
is the fact that a given node can have at most |E| successor nodes: 
formed by taking the last k bases of the node and extending them 
with each of the symbols in the alphabet. That is, we can define the 
successors of a node n: 



succ t {n) = {n[l, k) ■ b : b <E £} (1) 
pred t (n) = {b ■ n[0, k - 1) : b G E} (2) 

To use the de Bruijn graph for assembly, we can build a subset 
of the graph by finding the nodes and edges that are supported by 
the information in the sequence reads. The edges are also annotated 
with a count of the number of times that a p-mer is observed in the 
sequence data. The counts are used for two purposes. The first is 
to distinguish edges that arise from sequencing errors (which will 
have very low counts) from those that arise from the underlying 
genome (which will have higher counts). The second is to estimate 
the number of copies of that edge in the underlying genome. 

Given a set of reads S, we can define a de Bruijn assembly graph, 
defining the nodes Vs in terms of the edges Es rather than the other 
way round, as we did above. To define the nodes, we create two 
(overlapping) sets: the set of nodes Fsfrom which an edge proceeds, 
and the set of nodes Ts to which an edge proceeds. 

Es = {si\j,j +p) : < j < \ Si \ -fc;V Si G S} (3) 

F s = {e[l,p+l):e€E s } 

T S = {e[0, p) : e G E s } 

Vs = FsUTs (4) 

G s = (Vs,E s ) (5) 

From the DNA alphabet and equation Q] a given node in the 
assembly graph can have at most 4 successor nodes, and by 
equation[2] a given node can also have at most 4 predecessor nodes. 

1.1.1 Reverse Complements An important distinction between 
ideal strings and the DNA sequences that are used in genome 
assembly is that the latter can be read in two directions: forwards, 
and in the reverse direction with the individual DNA letters 
exchanged with their Watson-Crick complements (A -o- T and C -R- 



G). In most sequencing scenarios, fragments of DNA are randomly 
sequenced in either direction, something that must be taken into 
account during assembly. First, sequence reads are processed twice 
- once reading them forwards, and then reading them in the reverse 
complement direction. Then, in most cases, nodes corresponding to 
reverse complement sequences are merged, and the edges are made 
bi-directed to match up the sequences correctly (see, for example 
iMedvedev et al\ J2007h ). For our current discussion, we will not 
combine them, but will store them separately. This makes the 
graph symmetric - a forward traversal corresponds to a backwards 
traversal on the reverse complement path, and vise versa. 
Figure[T]shows a de Bruijn assembly graph for a short string. 

1.1.2 From de Bruijn Assembly Graphs To Genomes The 
de Bruijn graph is both Euler ian and Hamiltonian, a property 
that lldurv and Waterman] d 19951) showed was useful for genome 
assembly. In principle, at least, the assembled sequence corresponds 
to an Eulerian tour of the de Bruijn assembly graph. The details 
of how this may be done in practice are beyond the scope of 
o ur current discussion, but the approaches include t hose described 



i Pevzner et al\ d200lh : Zerbino and Birnevl d2008l ): Hackson et al.\ 
d2009l) : ISimpson et a/.ld2009l) . Our current discussion is focussed on 



how we might represent the de Bruijn assembly graph in a practical 
program for performing large genome assembly. 

A simple approach to representing the de Bruijn assembly graph 
is to represent the nodes as ordinary records (i.e. using a struct 
in C or C++), and the edges as pointers between them. If we assume 
a node contains the k length substring (or fc-mer) represented as a 
64 bit integer (assuming k < 32), 32 bit edge counts and pointers 
to four possible successor nodes, and there are no memory allocator 
overheads, then the graph will require 56 bytes per node. In the 
drosophila melanogaster genome, with k = 25, there are about 
245 million nodes (including reverse complements), so we would 
expect the graph to take nearly 13 GB. For the human genome with 
k — 25 there are about 4.8 billion nodes (again, including reverse 
complements), so the graph would require over 250 GB. These data 
structures are large, but more is needed, because there is no way in 
what is described to locate a given node, so for instance a simple 
hash table (generously assuming a load factor of 1) might require 
an extra 16 bytes (hash value + pointer) per node or over 70 GB for 
the human genome. These figures are extremely conservative, since 
they ignore the effect of sequencing errors. 

We can get an estimate of the proportion of edges in the graph 
that are due to errors with a simple analysis. Most sequencing errors 
give rise to unique fc-mers, and hence many edges that occur only 
once. Ignoring insertion and deletion errors, for a given k (or p), a 
single error leads to p spurious edges, which, if we assume a random 
distribution of errors, are overwhelmingly likely to be unique. Thus, 
the number of spurious edges is proportional to the volume of 
sequence data, whereas the number of true edges is proportional 
to the genome size, and will converge on that number as the volume 
of sequence data increases. For example, consider the case of an 
organism with a 1 Mbp genome, which we sequence with sequence 
reads lOObp in length. If we assume that on average a read contains 1 
error, then with p = 26, we will typically have 74 true edges and 26 
spurious edges. Assuming the reads are uniformly distributed, once 
the number of reads exceeds about 14,000, almost all the 1 million 
true edges will be present, and there will be about 364,000 spurious 
edges. Beyond this, as the number of reads increases, the number of 
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Fig. 1. A de Bruijn assembly graph and its representation. 



Sequence: GCTTTCGACGTTTCA 
Reverse complement: TGAAACGTCGAAAGC 



(a) Source sequence 
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(b) The corresponding assembly graph. The edges are labelled with counts. The path marked 
with bold arrows is TTCGAC. 
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(c) Extracted p-mers with counts. 






1 


o • 


■ 1 ■ ■ 


110 


... j 





1 • 


■ 1 


1 •■• 


10 


... ! . 


• 1 ■•■ 


1 











1 ■ ■ 


10 10 


•■•0100 


< o 


e> 


H 


H 


fit! CJ 15 H 


o 


CD 


H 


EH 


fifi 


<! u cd eh 


EH 


Eh 


<; 


o 


cd 


Eh 




fit; U CD H 


fit; U CD EH 


u u 


o 


O 


CD 


< fit! <! < 


H 


Eh 


H 


EH 


fit 


o o o u 


H 


EH 


CD 


cd 


cd 


CD 




o o u o 


H H Eh H 


ft, < 




< 


O 


C5 CD CD CD 


C5 


CD 


C5 


H 


< 


fit; < < i< 


o 


Eh 


o 


cj 


o 


O 


CD 


H Eh Eh Eh 


Eh Eh Eh Eh 


fit! < 


;;i 


< 


< 


O O O O 


O 


U 


O 


O 


cd 


15 O CD CD 


cd 


CD 


H 


H 


En 


H 


Eh 


H Eh Eh Eh 


Eh Eh Eh Eh 



(d) The sparse bitmap representation. The gray boxes exemplify groups of edges that proceed from a single node. The arrows show the sequence TTCGAC. 
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(e) Dense array of counts. 



true edges will remain the same, but the number of spurious edges 
will continue to increase linearly. By the time the coverage (the 
expected count on all the true edges) reaches 40 (a typical coverage 
for genome assembly), we would expect to see about 14 million 
spurious edges. That is, the spurious edges would outnumber the 
true edges by a factor of 14. 

Figure|2]illustrates this problem. 

Much of this space is devoted to storing pointers, so the question 
naturally arises: are these pointers necessar y, or can they be 
avoide d? Existing asse mblers such as v elvet dZerbino and Birnevl 
J2008h ) and ABySS dSimpson et al\ d2009l) ) combine nodes 
corresponding to forward and reverse complements, and merge 
nodes on unbranched paths, and although these techniques 
significantly reduce the amount of memory required, they none the 
less represent an ad hoc approach to the problem of reducing the 
memory required to represent the de Bruijn assembly graph. 

2 APPROACH 

Our approach to memory-efficient representation of an assembly 
graph begins by reframing the question of whether the pointers in a 
naive graph representation are necessary. Rather we ask what infor- 
mation is necessary, and what is redundant or ephemeral. How many 
bits are required to represent the de Bruijn assembly graph from an 
information-theoretic point of view? 

The de Bruijn assembly graph is a subset of the de Bruijn graph. 
Of the |E| P edges in the de Bruijn graph, the assembly graph 
contains \Es\- The self-information of a set of edges that make up 



#Edges 




Total edges 
Error edges 

True edges 



#Reads 

Fig. 2. A sketch showing the relationship between the number of sequence 
reads and the number of edges in the graph. Because the underlying genome 
is fixed in size, as the number of sequence reads increases the number of 
edges in the graph due to the underlying genome will plateau when every 
part of the genome is covered. Conversely, since errors tend to be random 
and more-or-less unique, their number scales linearly with the number of 
sequence reads. Once enough sequence reads are present to have enough 
coverage to clearly distinguish true edges (which come from the underlying 
genome), they will usually be outnumbered by spurious edges (which arise 
from eiTors) by a substantial factor. 

an assembly graph, and hence the minimum number of bits required 
to encode the graph, is 

#bits = log (,5,) (6) 
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(Note, that unless otherwise specified, all logarithms are base 2.) 

For the de Bruijn assembly graph with k — 25, the human 
genome (build 37) yields 4,796,397,453 distinct edges, including 
reverse complements. By the equation above, taking S to be the 
genome itself: 

( 4 26 \ 
Mits = log « 1.2 GB 

fr s U, 796, 397, 453 J 

We do not need to store the nodes explicitly, since they are readily 
inferred from the edges: 

from-node(e) = e[0,p— 1) 
to-node(e) — e[l, p) 

Equation [6] gives a lower bound on the number of bits required 
to represent the de Bruijn assembly graph. We would like to 
find a concrete representation that comes close to that theoretical 
minimum while allowing efficient random access. The notion that 
the assembly graph is a subset of the de Bruijn graph immediately 
suggests that we could create a bitmap with a bit for each edge in 
the de Bruijn graph, and set the bits for the edges that occur in the 
assembly graph. Such a scheme depends on being able to enumerate 
the p-mers (i.e. the edges). This is done trivially by numbering the 
bases (we use A = 0, C = 1,G = 2 and T = 3), and interpreting 
the p symbols as an integer with 2p bits. Conceptually, then, we 
can create a bitmap with 4 P bits, and place Is in the positions 
corresponding to the edges in the assembly graph. 

Given such a bitmap, we can determine the successor set of a 
given node from the definition of the de Bruijn assembly graph, 
by probing the positions corresponding to the 4 edges that could 
proceed from the node. For a node corresponding to a fc-mer n the 
four positions in the bitmap are 4n, An + 1, 4n + 2 and A n + 3. 

The re is a particular formalism, first proposed by Ijacobsor] 
dl989h for querying sets of integers represented as bitmaps which 
is useful in this setting. Given a bitmap b with the positions of 
the set members set to 1 and the rest of the positions set to 0, 
the formalism uses two query operators rank and select with the 
following definitions Q 

rankb(p) = 

0<i<p 

select^(i) = max {p < n\ rankb(p) < i} 

Intuitively, rankb (p) is the number of ones in the bitmap b to the 
left of position p, and select^(i) is the position of the i-th set bit, 
where the set bits are numbered starting from zero. 

These operations are inverses in that rankb(selectb(i)) — i 
for i £ {Q...p — 1}, and selecti,(rank^,(j>)) — p for p £ 
{p:pe{0...v~l};b p = l}. 

Using the rank/select formalism, we can compute the set of the 
ranks of the successor edges for a node n efficiently given a bitmap 



1 The literature contains several slightly different definitions that arise from 
different conventions for subscripting arrays - mathematical literature tends 
subscript from one; computer science literature from zero. We use the latter. 



representing the set of edges: 

succs(n) = {r £ [rankE s {An),rankE s (An + A))} 

This forms the basis of a method for efficient traversal of a de Bruijn 
assembly graph represented as a set of integers (i.e., a bitmap). 

Next we consider how the edge counts should be represented. For 
this we draw on the rank/select formalism again, and note that while 
the edges are sparse (a point that we will come back to shortly), the 
ranks of the edges are dense, filling the range [0, Therefore 
we can store the edge counts in a vector of 32 bit integers. 

3 METHODS 

The preceding discussion presented a technique for representing a 
de Bruijn assembly graph as a bitmap using 4 P bits. For a typical 
value of p — 26 (i.e. k = 25), the bitmap would require 512TB. 
This is clearly infeasible (and larger k would be worse), but the 
bitmap is incredibly sparse. Of the 4.5 x 10 15 bits, for the human 
genome, only 4.7 x 10 9 are 1. That is, the fraction of the bits that 
are set is 10 -8 , so a sparse representation should be used. In fact, 
Equation [6] gives a precise lower bound on the number of bits that 
a sparse representation requires, and there has been a large amount 
of research in the last two decades on the efficient representation of 
data structures that are close to the theoretical limit. 

Let Bv :ll b e the set of bitm aps with v bits, where exactly 
(i bits are set. Ijacobsor] (l989) defines a succinct representation 
as a way of mapping the elements of Bv tfl into a read-only 
memory such that the amount of space used to represent a bitmap 
is close to (1 + o(l)) log bits. A succinct data structure 

is a succinct representation which also supports desired query 
operations efficiently. "Efficiently" can mean either low asymptotic 
complexity, or practical speed on real hardware. In our case, the 
query op erations that we wish to support are rank and select. 
Although Jacobson ( 1989) defines su ccinct data structures as read - 
only obiects. lRaman et aZld200ll) and lMakinen and Navarro! 1 2008), 
amongst others, show how insert and delete can be implemented 
without sacrificing the s uccinct nature of the representa tion. A 
summary (abstracted from lOkanohara and S adakane (2006)) of the 
data structures that we use are shown in Table [T] 

The darray and sarray data structures dOkanohara an d Sadakane 
(2006)) are optimised for the case when the bitmap is "dense" or 
"sparse" respectively. If p ~ I//2, log ~ v, so storing the 
uncompressed bitmap is optimal; in this case, the bitmap is dense, 
and so rank and select can be implemented with o(y) extra space 
to speed up those operations. If pjv is small, then the bitmap 
is sparse; in this case, the bitmap can be compress ed close to 
optimal space using Elias-Fano coding teliasl d 1974b ), which is 
the basis for sarray. The other main data structure that we use 
is rrrarray dRaman et ~ai\ d2007h ). which uses space very close 
to optimal over the entire range of values of pjv, with moderate 
overhead in space usage. 

To create a representation of the de Bruijn assembly graph, we 
extract the p-mers from the input sequences, accumulating them in 
RAM and when RAM is "full", sorting them (retaining duplicates) 
and writing the sorted run to disk. Once all the p-mers have been 
accumulated into sorted runs, the runs are then binary merged, 
and the final list is processed, counting duplicate p-mers to yield a 
sequence of {edge, count) pairs which are used to construct a sparse 
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Table 1. Summary of succinct data structures. 



Method Size (bits) rank complexity select complexity 



darray v + o(u) O(l) 0(log 4 fj,/ log v) 

sarray a* log £ + 1.92a* + o(ji) 0(log f) + OQog A fi/hgu) 0(logV/log^) 

rrrarray log Q + o{/i) + 0(log log v) O(l) O(l) 



array (i.e. sarray), and a vector of edge counts. The merging phase 
uses log TV passes, merging pairs of sorted runs. 

Returning to the representation of the edge counts, in Section [2] 
we suggested storing the counts in a vector of 32 bit integers indexed 
by edge rank. This actually uses much more memory than necessary. 
As previously noted, prior to error removal, a vast majority of edges 
in the graph are spurious and will have a very low edge count. Most 
of the true edges have modest counts also: edges that are unique 
in the underlying genome will have a count somewhere around the 
basic coverage (e.g. 15-50). For most edges 8 bits of storage is 
sufficient, and for most of the remainder 16 bits is sufficient. Only 
a handful of edges, in practice, need more than 16 bits. Therefore 
using 32 bits for every edge is very wasteful. 

There are many techniques for c reating compressed repres- 
entations of vectors of integers (see [Moffat an d Turninl |2002)), 
but in most cases they do not provide efficient random access. 
Succinct data structures impl ementing rank/select yield an effective 
technique first introduced by Brisabo a et al\ d2009h . We split each 
count into the three parts alluded to above: the least significant 
8 bits, the "middle" 8 bits and the most significant 16 bits. We 
store the least significant 8 bits in a dense vector of bytes L. 
Corresponding to it, we store a succinct bitmap Bl with a 1 marking 
those entries for which the middle 8 bits, or the most significant 16 
bits are nonzero. In a dense vector of bytes M (indexed by rank 
in Bl) we store the middle 8 bits of those entries for which a 1 
exists in Bl- Corresponding to M, we store a sparse bitmap Bm 
with a 1 marking those entries for which the most significant 16 
bits are nonzero. Finally, we have a dense vector if 16 bit words H 
(indexed by rank in Bm) with the most significant bits of those 
entries marked in Bm- The bitmaps Bl and 73a/ are represented 
using rrrarray. 

4 RESULTS 

We have created a program that uses straightforward implement- 
ations of the succinct data structures we have described to build 
de Bruijn assembly graphs for the genomes of the organisms listed 
in Table [2] All the reference genomes were obtained from the 
NCBI archive. The number of edges (which include duplicates, 
but exclude reverse complements) are marginally different to the 
genome sizes reported in the literature (which themselves vary) 
because the edges do not include undetermined bases represented 
as Ns in the genomes, and the size of the genome builds do not 
correspond exactly to the estimated genome size. 

In Table ?? we report the size of the graph and multiplicities data 
for the de Bruijn assembly graph constructed over the reference 
genomes. We also report the graph construction time on a server 
with 8 cores running at about 2 GHz, with 32 GB RAM, of which 



Table 2. Genomes used for testing with the number of distinct edges 
(excluding reverse complements, but including duplicates) as a measure of 
the genome size, the size of the assembly graph (including the edge counts) 
in bytes, and the time, in seconds, taken to build the graph. 



Organism 


#Edg 


cs 


Size 




Time 


mycobacterium leprae 


3.2 


X 


10 6 


3.3 


X 


10 7 


5 


thalassiosira pseudonana 


3.1 


X 


10 7 


3.2 


X 


10 8 


50 


caenorbahditis elegans 


1.0 


X 


10 8 


9.8 


X 


10 8 


154 


arabidopsis thaliana 


1.2 


X 


10 s 


1.2 


X 


10 9 


187 


drosophila melanogaster 


1.6 


X 


10 s 


1.3 


X 


10 9 


317 


oryza sativa 


3.7 


X 


10 8 


3.0 


X 


10 9 


428 


danio rerio 


1.5 


X 


10 9 


1.1 


X 


10 10 


5,448 


mus musculus 


2.5 


X 


10° 


2.2 


X 


10 10 


18,546 


homo sapiens 


2.9 


X 


10 9 


2.5 


X 


10 10 


14,235 



the graph construction process used about 2 GB. We consistently 
find that our code results in graphs requiring about 5.2 bytes per 
edge, including the storage for the edge multiplicities which is less 
than the 8 bytes for storing single pointer on a 64-bit architecture. 
There is a greater degree of variability in the running time than there 
is in the sizes of the assembly graphs, with the two largest genomes 
(mouse and human) being the slowest (when weighted by genome 
size). This is partly due to the log TV on-disk passes required by the 
binary merge used for the graph construction, and also because the 
disks were shared on a cluster, and the longer runs will have suffered 
some degree of interference. 

It is difficult to compare fairly and directly to existing tools, 
but to give some comparison, we tried running velvet (also with 
k = 25). Using synthesized error free reads, at 30 times coverage, 
our 32 GB server was only able to assemble the smallest of these 
genomes (mycobacterium leprae which is about 3.2 Mbp) with an 
observed peak memory usage of about 325 MB. The next smallest 
(thalassiosira pseudonana) failed when the process (velvetg) 
exhausted main memory. The graph and edge counts using our 
representation required 32 MB. 

To test the speed of the random access in the graph, we used 
a program that performs depth first traversal to find the set of 
paths in the graph that do not contain branches. We ran it on 
the graph for the thalassiosira organism with k = 25, which 
contains 60,312,974 edges, and it took 202 seconds. Each node 
is visited approximately once, and at each node the code does 4 
rank operations to determine the number of incoming and outgoing 
edges. Thus, the program is performing approximately 1.2 million 
rank operations per second. It is rather difficult to estimate how 
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this would compare to a pointer based implementation, but we 
would expect a pointer based implementation to be up to an order 
of magnitude faster. We note, however, that our implementation 
has not been highly tuned, and in any case, the compactness of 
our representation makes the thalassiosira genome fit in memory 
(easily), requiring 302 MB for the graph and about 4 MB for the 
remaining structures required for the graph traversal. 



5 DISCUSSION 

The analysis, presented in Section [2] suggested that for the human 
genome, we would require a minimum of 1.2 GB, but in our 
representation, we use 20 GB. We expect the indexes that support 
the rank and select operations to take some space, but the difference 
is more than an order of magnitude. The explanation lies in an 
important detail of the implem entation of the sparse array from 
Okanohara and Sadakane (2006). The minimum space consumption 
calculation has two parameters: u, the number of positions in the 
bitmap; and /i the number of positions set to 1. In our computation 
we took v — 4 P , but the implementation, which is (necessarily) 
built around machine word sizes takes v = 2 64 . If we recompute 
the minimum number of bits required under that assumption, we 
have: 



■j^bits = log 



4, 796, 397, 453 



19 GB 



Further research is necessary to build an implementation of the 
sparse array that allows us to set fj, at values closer to 4 P when p 
is less than 32. It may be possible at values of v = 2' +J where i 
and j are machine words sizes (e.g. i = 32 and j — 16 would allow 
P = 24). 

The techniques we have presented are by no means the only way 
to reduce the memory requirements of de Bruijn graph assembly. 
Another approach is to use a hash table that maps from fc-mer to 
a record containing the counts on the 4 possible successor edges. 
Since the 4 successor fc-mers are trivially derived from the current 
fc-mer, we can store just this map. Moreover, this technique could 
use a variable length coding technique like the one we use to store a 
single byte for each of the counts in most cases. For such a structure 
to be useful, we would need an open addressing hash table, to 
avoid an indirection layer of pointers (as would be required for a 
separate chaining hashing scheme). Further, for a hashing scheme 
to be competitive, we would need to get the load factor close to 
1.0. Exactly suc h a hashing scheme, c alled cuckoo hashing has 
been proposed by Pagh and Rodler (20oT|), wit h several refinements , 
including those proposed bv lRossI d2007 ). and lFotakis et al.\ d2003h . 
The latter in particular shows a variant that allows the load factor 
of the hash table to approach 1.0 while retaining efficient access. 
To make a rough comparison to our approach, let us consider the 
human genome with 4.7 billion edges. The hashing scheme we have 
outlined uses 8 bytes for the fc-mer, and 1 byte for each of the counts. 
Since only a tiny minority of edges will require more than 1 byte, 
this will give us a close approximation to the space usage. Thus, 
each edge requires 12 bytes. Assuming a load factor of 90%, which 
is likely to be close to the upper limit in practice, this representation 
would require about 60 GB, which is a significant improvement on 
the pointer based implementation, but is not nearly as efficient as 
the representation we have proposed. 



An important property of our representation, compared to either 
of those outlined above, is that ours is succinct. That is, in a 
formal sense, ours uses within a small constant factor, the minimum 
possible space. 

We have presented a practical and efficient representation of the 
de Bruijn assembly graph, but of course there is much more to 
doing de novo assembly with de Bruijn graph methods. Although 
we have sketched the space issues caused by sequencing errors, we 
have not addressed the detection and correction of errors. Also, 
a combinatoric number of Eulerian paths exist in the de Bruijn 
assembly graph, among which true paths must be identified. This 
is usually done in the first instance by using the sequence reads to 
disambiguate paths. In the second instance, this is done by using 
paired sequence reads (e.g. paired-end and mate-pair sequence 
reads), in a process usually called scaffolding. The algorithms 
described in the literature can either be implemented directly on our 
representation, or in most cases, adapted. One important caveat is 
that our representation depends on the properties of the de Bruijn 
graph (i.e. the relationship between nodes and the edges that connect 
them), and while edges may be added or removed, the representation 
cannot be treated as an arbitrary graph - duplicating or arbitrarily 
merging parts of the graph. We do not believe this is a significant 
obstacle to building a complete assembler (which we are doing) 
based on this representation. 

As well as building a practical assembler based on the 
representation we have presented, there are several opportunities 
for improving the graph construction. At the moment, the run- 
time is dominated by sorting, which is done sequentially, and with 
fairly generic sorting code. We speculate that the sequential sorting 
speed could be doubled with modest effort, and the whole could be 
parallelized fairly easily. 

5.1 A Succinct Representation of Sequence Reads 

Among the several components required for a practical assembler 
mentioned above, the use of reads during assembly is worthy 
of some further examination. A practical assembler will use the 
sequence reads to help disambiguate conflations in the de Bruijn 
graph. Here we present a simple technique that uses succinct data 
structures to form a compact representation of the sequence reads, 
given the de Bruijn assembly graph. 

The de Bruijn graph already contains most of the information 
present in the sequence reads. Each sequence read corresponds to 
a path in the de Bruijn assembly graph. The information present in 
the sequence reads that is not present in the graph is: (0 where in the 
graph the sequence read starts; (ii) where in the graph it ends, or its 
length; and (Hi) at nodes in the graph where there is more than one 
out-going edge, which edge should be followed. 

If we sort the sequence reads (discarding the order of the reads), 
we can efficiently store the initial fc-mer of each read, and, moreover 
construct an efficient index that lets us determine which reads begin 
with a given fc-mer. The lengths of the reads can be stored efficiently 
by creating a sparse bitmap corresponding to the concatenation of 
all the sequence reads, with a 1 denoting the start of a sequence 
read (rrrarray would be a logical choice for such a bitmap). The 
rank and select functions give an efficient means of determining the 
position in the bitmap of the start and end of a given read. 

The sequence of choices, or the path that the sequence read 
follows may be encoded very efficiently in the following way. At 
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each node, we can number the extant out-going edges [0, 3], and 
assign a rank to the edge taken by a given sequence read. The 
ranks may be assigned lexicographically, or in order of edge count 
(highest to lowest). These ranks require two bits, which we can 
store in a pair of sparse bitmaps - one for the least significant 
bit, and one for the most significant bit. The positions in these 
bitmaps correspond to the positions in the bitmap marking the initial 
positions of sequence reads. In practice, a large majority of nodes 
have only one out-going edge, so the rank will be 0, hence the 
bitmaps will be sparse. Most of the nodes which have more than 
one out-going edge have only two, so in the vast majority of cases, 
the most significant bit of the rank will be zero, making the bitmap 
for the most significant bit even more sparse than the one for the 
least significant bit. 

If one wished to use this encoding to encode sequence reads other 
than those represented in the de Bruijn assembly graph, then it is 
no longer the case that every sequence read corresponds to a path 
in the graph. In this case, a "nearest" path could be found, and 
the differences between the sequence read and the path could be 
recorded. This could be done using a sparse bitmap to record those 
positions at which the path and the sequence read diverge, and a 
corresponding vector (indexed by rank in the said bitmap) of bases 
could be used to store the actual base in the sequence read. There 
is an optimization problem to find the "nearest" path, but simple 
heuristics are likely to be sufficient. 

This scheme could be generalized for sequencing technologies 
where we may wish to expli citly encod e gaps in the sequence read, 
for example strobe reads by the use of an 

auxiliary bitmap marking the locations of the gaps. This would be 
an interesting line for further research. 

6 CONCLUSION 

We have presented a memory-efficient representation of the 
de Bruijn assembly graph using succinct data structures which allow 
us to represent the graph in close to the minimum number of bits. 
We have demonstrated its effectiveness on a number of genomes 
from bacterial to mammalian scale, including the human genome. 
Further work will build on this to produce a practical assembler. 
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